188 research outputs found
Optimum Search Schemes for Approximate String Matching Using Bidirectional FM-Index
Finding approximate occurrences of a pattern in a text using a full-text
index is a central problem in bioinformatics and has been extensively
researched. Bidirectional indices have opened new possibilities in this regard
allowing the search to start from anywhere within the pattern and extend in
both directions. In particular, use of search schemes (partitioning the pattern
and searching the pieces in certain orders with given bounds on errors) can
yield significant speed-ups. However, finding optimal search schemes is a
difficult combinatorial optimization problem.
Here for the first time, we propose a mixed integer program (MIP) capable to
solve this optimization problem for Hamming distance with given number of
pieces. Our experiments show that the optimal search schemes found by our MIP
significantly improve the performance of search in bidirectional FM-index upon
previous ad-hoc solutions. For example, approximate matching of 101-bp Illumina
reads (with two errors) becomes 35 times faster than standard backtracking.
Moreover, despite being performed purely in the index, the running time of
search using our optimal schemes (for up to two errors) is comparable to the
best state-of-the-art aligners, which benefit from combining search in index
with in-text verification using dynamic programming. As a result, we anticipate
a full-fledged aligner that employs an intelligent combination of search in the
bidirectional FM-index using our optimal search schemes and in-text
verification using dynamic programming outperforms today's best aligners. The
development of such an aligner, called FAMOUS (Fast Approximate string Matching
using OptimUm search Schemes), is ongoing as our future work
A polyhedral approach to sequence alignment problems
We study two problems in sequence alignment both from a theoretical and a practical point of view. For the first time in sequence alignment, we use tools from combinatorial optimization to develop branch-and-cut algorithms that solve these problems efficiently. The Generalized Maximum Trace formulation captures several forms of multiple sequence alignment problems in a common framework, among them is the original formulation of Maximum Trace. The Structural Maximum Trace Problem captures the comparison of RNA molecules on the basis of their primary sequence and their secondary structure. For both problems we derive a characterization in terms of graphs which we use to reformulate the problems in terms of integer linear programs. We then study the polytopes (or convex hulls of all feasible solutions)associated with the integer linear program for both problems. For each polytope we derive several classes of facet-defining inequalities and show that for some of these classes the corresponding separation problem can be solved in polynomial time. Thisleads to a polynomial time algorithm for pairwise sequence alignment that is not based on dynamic programming. Moreover, for multiple sequences the branch-and-cut algorithms for both sequence alignment problems are able to solve to optimality instances that are beyond the range of present dynamic programming approaches.Wir betrachten zwei Sequenz-Alignment-Probleme von einem theoretischen und praktischen Standpunkt aus. Dabei nutzen wir Methoden der kombinatorischen Optimierung, um Branch-and-Cut-Algorithmen zu entwickeln, die diese Probleme effizient lösen. Das sogenannte Generalized-Maximum-Trace-Problem beinhaltet verschiedene Arten von multiplen Sequenz-Alignment in einem einheitlichen Rahmen, darunter auch das ursprüngliche Maximum-Trace-Problem. Das sogenannte Structural-Maximum- Trace-Problem beschreibt den Vergleich von RNA-Molekülen, basierend auf deren Primär- und Sekundärstruktur. Wir leiten für beide Probleme eine graphentheoretische Formulierung her, welche wir dann zur Definition ganzzahliger linearer Programme benutzen. Wir untersuchen die Polytope (d.h. die konvexen Hüllen aller zulässigen Lösungen), die mit den ganzzahligen, linearen Programmen assoziiert sind. Für jedes Polytop leiten wir mehrere Klassen facettendefinierender Ungleichungen her und zeigen, daß für einige dieser Klassen das entsprechende Separationsproblem in Polynomialzeit gelöst werden kann. Dies impliziert unter anderem einen Polynomialzeitalgorithmus zum paarweisen Sequenzvergleich, welcher nicht auf dem Prinzip der dynamischen Programmierung beruht. Darüber hinaus sind die vorgestellten Branch-and- Cut-Algorithmen in der Lage, Probleminstanzen einer Größe optimal zu lösen, die mit Verfahren, welche auf dynamischer Programmierung beruhen, nicht gelöst werden könne
A polyhedral approach to sequence alignment problems
We study two problems in sequence alignment both from a theoretical and a practical point of view. For the first time in sequence alignment, we use tools from combinatorial optimization to develop branch-and-cut algorithms that solve these problems efficiently. The Generalized Maximum Trace formulation captures several forms of multiple sequence alignment problems in a common framework, among them is the original formulation of Maximum Trace. The Structural Maximum Trace Problem captures the comparison of RNA molecules on the basis of their primary sequence and their secondary structure. For both problems we derive a characterization in terms of graphs which we use to reformulate the problems in terms of integer linear programs. We then study the polytopes (or convex hulls of all feasible solutions)associated with the integer linear program for both problems. For each polytope we derive several classes of facet-defining inequalities and show that for some of these classes the corresponding separation problem can be solved in polynomial time. Thisleads to a polynomial time algorithm for pairwise sequence alignment that is not based on dynamic programming. Moreover, for multiple sequences the branch-and-cut algorithms for both sequence alignment problems are able to solve to optimality instances that are beyond the range of present dynamic programming approaches.Wir betrachten zwei Sequenz-Alignment-Probleme von einem theoretischen und praktischen Standpunkt aus. Dabei nutzen wir Methoden der kombinatorischen Optimierung, um Branch-and-Cut-Algorithmen zu entwickeln, die diese Probleme effizient lösen. Das sogenannte Generalized-Maximum-Trace-Problem beinhaltet verschiedene Arten von multiplen Sequenz-Alignment in einem einheitlichen Rahmen, darunter auch das ursprüngliche Maximum-Trace-Problem. Das sogenannte Structural-Maximum- Trace-Problem beschreibt den Vergleich von RNA-Molekülen, basierend auf deren Primär- und Sekundärstruktur. Wir leiten für beide Probleme eine graphentheoretische Formulierung her, welche wir dann zur Definition ganzzahliger linearer Programme benutzen. Wir untersuchen die Polytope (d.h. die konvexen Hüllen aller zulässigen Lösungen), die mit den ganzzahligen, linearen Programmen assoziiert sind. Für jedes Polytop leiten wir mehrere Klassen facettendefinierender Ungleichungen her und zeigen, daß für einige dieser Klassen das entsprechende Separationsproblem in Polynomialzeit gelöst werden kann. Dies impliziert unter anderem einen Polynomialzeitalgorithmus zum paarweisen Sequenzvergleich, welcher nicht auf dem Prinzip der dynamischen Programmierung beruht. Darüber hinaus sind die vorgestellten Branch-and- Cut-Algorithmen in der Lage, Probleminstanzen einer Größe optimal zu lösen, die mit Verfahren, welche auf dynamischer Programmierung beruhen, nicht gelöst werden könne
An exact mathematical programming approach to multiple RNA sequence-structure alignment
One of the main tasks in computational biology is the computation of
alignments of genomic sequences to reveal their commonalities. In case of DNA
or protein sequences, sequence information alone is usually sufficient to
compute reliable alignments. RNA molecules, however, build spatial
conformations—the secondary structure—that are more conserved than the actual
sequence. Hence, computing reliable alignments of RNA molecules has to take
into account the secondary structure. We present a novel framework for the
computation of exact multiple sequence-structure alignments: We give a graph-
theoretic representation of the sequence-structure alignment problem and
phrase it as an integer linear program. We identify a class of constraints
that make the problem easier to solve and relax the original integer linear
program in a Lagrangian manner. Experiments on a recently published benchmark
show that our algorithms has a comparable performance than more costly dynamic
programming algorithms, and outperforms all other approaches in terms of
solution quality with an increasing number of input sequences
05471 Abstract Collection -- Computational Proteomics
From 20.11.05 to 25.11.05, the Dagstuhl Seminar 05471 ``Computational Proteomics\u27\u27
was held in the International Conference and Research Center (IBFI),
Schloss Dagstuhl.
During the seminar, several participants presented their current
research, and ongoing work and open problems were discussed. Abstracts of
the presentations given during the seminar as well as abstracts of
seminar results and ideas are put together in this paper. The first section
describes the seminar topics and goals in general.
Links to extended abstracts or full papers are provided, if available
Accurate multiple sequence-structure alignment of RNA sequences using combinatorial optimization
Background: The discovery of functional non-coding RNA sequences has led to an increasing interest in algorithms related to RNA analysis. Traditional sequence alignment algorithms, however, fail at computing reliable alignments of low-homology RNA sequences. The spatial conformation of RNA sequences largely determines their function, and therefore RNA alignment algorithms have to take structural information into account. Results: We present a graph-based representation for sequence-structure alignments, which we model as an integer linear program (ILP). We sketch how we compute an optimal or near-optimal solution to the ILP using methods from combinatorial optimization, and present results on a recently published benchmark set for RNA alignments. Conclusions: The implementation of our algorithm yields better alignments in terms of two published scores than the other programs that we tested: This is especially the case with an increasing number of inpu
Antilope - A Lagrangian Relaxation Approach to the de novo Peptide Sequencing Problem
Peptide sequencing from mass spectrometry data is a key step in proteome
research. Especially de novo sequencing, the identification of a peptide from
its spectrum alone, is still a challenge even for state-of-the-art algorithmic
approaches. In this paper we present Antilope, a new fast and flexible approach
based on mathematical programming. It builds on the spectrum graph model and
works with a variety of scoring schemes. Antilope combines Lagrangian
relaxation for solving an integer linear programming formulation with an
adaptation of Yen's k shortest paths algorithm. It shows a significant
improvement in running time compared to mixed integer optimization and performs
at the same speed like other state-of-the-art tools. We also implemented a
generic probabilistic scoring scheme that can be trained automatically for a
dataset of annotated spectra and is independent of the mass spectrometer type.
Evaluations on benchmark data show that Antilope is competitive to the popular
state-of-the-art programs PepNovo and NovoHMM both in terms of run time and
accuracy. Furthermore, it offers increased flexibility in the number of
considered ion types. Antilope will be freely available as part of the open
source proteomics library OpenMS
LaRA 2: parallel and vectorized program for sequence–structure alignment of RNA sequences
Background
The function of non-coding RNA sequences is largely determined by their spatial conformation, namely the secondary structure of the molecule, formed by Watson–Crick interactions between nucleotides. Hence, modern RNA alignment algorithms routinely take structural information into account. In order to discover yet unknown RNA families and infer their possible functions, the structural alignment of RNAs is an essential task. This task demands a lot of computational resources, especially for aligning many long sequences, and it therefore requires efficient algorithms that utilize modern hardware when available. A subset of the secondary structures contains overlapping interactions (called pseudoknots), which add additional complexity to the problem and are often ignored in available software.
Results
We present the SeqAn-based software LaRA 2 that is significantly faster than comparable software for accurate pairwise and multiple alignments of structured RNA sequences. In contrast to other programs our approach can handle arbitrary pseudoknots. As an improved re-implementation of the LaRA tool for structural alignments, LaRA 2 uses multi-threading and vectorization for parallel execution and a new heuristic for computing a lower boundary of the solution. Our algorithmic improvements yield a program that is up to 130 times faster than the previous version.
Conclusions
With LaRA 2 we provide a tool to analyse large sets of RNA secondary structures in relatively short time, based on structural alignment. The produced alignments can be used to derive structural motifs for the search in genomic databases
- …